Data analysis

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of statistical or structural models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The term data analysis is sometimes used as a synonym for data modeling.

Contents

Type of data

Data can be of several types

The process of data analysis

Data analysis is a process, within which several phases can be distinguished:[1]

Data cleaning

Data cleaning is an important procedure during which the data are inspected, and erroneous data are—if necessary, preferable, and possible—corrected. Data cleaning can be done during the stage of data entry. If this is done, it is important that no subjective decisions are made. The guiding principle provided by Adèr (ref) is: during subsequent manipulations of the data, information should always be cumulatively retrievable. In other words, it should always be possible to undo any data set alterations. Therefore, it is important not to throw information away at any stage in the data cleaning phase. All information should be saved (i.e., when altering variables, both the original values and the new values should be kept, either in a duplicate data set or under a different variable name), and all alterations to the data set should carefully and clearly documented, for instance in a syntax or a log.[2]

Initial data analysis

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that are aimed at answering the original research question. The initial data analysis phase is guided by the following four questions:[3]

Quality of data

The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analyses: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms, normal probability plots), associations (correlations, scatter plots).
Other initial data quality checks are:

The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase.[4]

Quality of measurements

The quality of the measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported in the literature.
There are two ways to assess measurement quality:

Initial transformations

After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial transformations of one or more variables, although this can also be done during the main analysis phase.[6]
Possible transformations of variables are:[7]

Did the implementation of the study fulfill the intentions of the research design?

One should check the success of the randomization procedure, for instance by checking whether background and substantive variables are equally distributed within and across groups.
If the study did not need and/or use a randomization procedure, one should check the success of the non-random sampling, for instance by checking whether all subgroups of the population of interest are represented in sample.
Other possible data distortions that should be checked are:

Characteristics of data sample

In any report or article, the structure of the sample must be accurately described. It is especially important to exactly determine the structure of the sample (and specifically the size of the subgroups) when subgroup analyses will be performed during the main analysis phase.
The characteristics of the data sample can be assessed by looking at:

Final stage of the initial data analysis

During the final stage, the findings of the initial data analysis are documented, and necessary, preferable, and possible corrective actions are taken.
Also, the original plan for the main data analyses can and should be specified in more detail and/or rewritten.
In order to do this, several decisions about the main data analyses can and should be made:

Analyses

Several analyses can be used during the initial data analysis phase:[11]

It is important to take the measurement levels of the variables into account for the analyses, as special statistical techniques are available for each level:[12]

Main data analysis

In the main analysis phase analyses aimed at answering the research question are performed as well as any other relevant analysis needed to write the first draft of the research report. [13]

Exploratory and confirmatory approaches

In the main analysis phase either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for models that describe the data well. In a confirmatory analysis clear hypotheses about the data are tested.

Exploratory data analysis should be interpreted carefully. When testing multiple models at once there is a high chance on finding at least one of them to be significant, but this can be due to a type 1 error. It is important to always adjust the significance level when testing multiple models with, for example, a bonferroni correction. Also, one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well. When a model is found exploratory in a dataset, then following up that analysis with a comfirmatory analysis in the same dataset could simply mean that the results of the comfirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place. The comfirmatory analysis therefore will not be more informative than the original exploratory analysis.[14]

Stability of results

It is important to obtain some indication about how generalizable the results are.[15] While this is hard to check, one can look at the stability of the results. Are the results reliable and reproducible? There are two main ways of doing this:

Statistical methods

A lot of statistical methods have been used for statistical analyses. A very brief list of four of the more popular methods is:

Free software for data analysis

Nuclear and particle physics

In nuclear and particle physics the data usually originate from the experimental apparatus via a data acquisition system. It is then processed, in a step usually called data reduction, to apply calibrations and to extract physically significant information. Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc. The resulting data n-tuples are then scrutinized by the physicists, using specialized software tools like ROOT or PAW, comparing the results of the experiment with theory.

The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation software like Geant4, predict the response of the detector to a given theoretical event, producing simulated events which are then compared to experimental data.

See also

References

  1. ^ Adèr, 2008, p. 334-335.
  2. ^ Adèr, 2008, p. 336-337.
  3. ^ Adèr, 2008, p. 337.
  4. ^ Adèr, 2008, p. 338-341.
  5. ^ Adèr, 2008, p. 341-3342.
  6. ^ Adèr, 2008, p. 344.
  7. ^ Tabachnick & Fidell, 2007, p. 87-88.
  8. ^ Adèr, 2008, p. 344-345.
  9. ^ Adèr, 2008, p. 345.
  10. ^ Adèr, 2008, p. 345-346.
  11. ^ Adèr, 2008, p. 346-347.
  12. ^ Adèr, 2008, p. 349-353.
  13. ^ Adèr, 2008, p. 363.
  14. ^ Adèr, 2008, p. 361-362.
  15. ^ Adèr, 2008, p. 368-371.
  16. ^ Zeptoscope.synopsia.net

Further reading